Show the code
import pandas as pd
import numpy as np
from lets_plot import *
# add the additional libraries you need to import for ML here
LetsPlot.setup_html(isolated_frame=True)import pandas as pd
import numpy as np
from lets_plot import *
# add the additional libraries you need to import for ML here
LetsPlot.setup_html(isolated_frame=True)# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html
# Include and execute your code here
# import your data here using pandas and the URLA SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
The charts show clear differences between homes built before and after 1980. Older homes tend to have lower net prices and smaller living areas, while newer homes generally have higher values in both features. The distributions show strong separation, meaning net price and living area contain useful predictive information for machine learning.
# Include and execute your code here
# Load data
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)
# Label target variable
df["before1980_label"] = df["before1980"].map({1: "Before 1980", 0: "1980 or newer"})
# Chart 1: Boxplot of Net Price
ggplot(df, aes(x='before1980_label', y='netprice', fill='before1980_label')) + \
geom_boxplot() + \
scale_y_log10() + \
labs(
title="Net Price vs Home Age Category",
x="Home Built",
y="Net Price (log scale)"
) + \
theme_bw()# Load data
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)
# Label target variable
df["before1980_label"] = df["before1980"].map({1: "Before 1980", 0: "1980 or newer"})
# --- CHART 2: Livearea Density Plot by Before1980 ----
ggplot(df, aes(x='livearea', color='before1980_label', fill='before1980_label')) + \
geom_density(alpha=0.4) + \
scale_x_log10() + \
labs(
title="Distribution of Home Size (Livearea) by Home Age Category",
x="Live Area (square feet, log scale)",
y="Density"
) + \
theme_bw()Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.
For this task, I tested three different classification models—Logistic Regression, Decision Tree, and Random Forest—to predict whether a home was built before 1980. Logistic Regression performed well but is limited because it assumes linear relationships, while the Decision Tree achieved perfect accuracy but is more likely to overfit the training data. The Random Forest model ultimately became my final choice because it consistently reached 100% accuracy, handled complex nonlinear patterns between housing features, and reduced overfitting.
# Include and execute your code here
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load data
url = "https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv"
df = pd.read_csv(url)
# Clean data: remove leakage variable
df = df.drop(columns=["tasp"], errors="ignore")
# Target
y = df["before1980"]
# Features
X = df.drop(columns=["before1980", "parcel"])
# One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)
# Train/test split WITH stratification
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# ----------------------------------------------------
# MODEL 1 — Logistic Regression (scaled data)
# ----------------------------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
log_model = LogisticRegression(max_iter=3000)
log_model.fit(X_train_scaled, y_train)
log_pred = log_model.predict(X_test_scaled)
log_acc = accuracy_score(y_test, log_pred)
# ----------------------------------------------------
# MODEL 2 — Decision Tree
# ----------------------------------------------------
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
tree_pred = tree.predict(X_test)
tree_acc = accuracy_score(y_test, tree_pred)
# ----------------------------------------------------
# MODEL 3 — Random Forest (Final Choice)
# ----------------------------------------------------
rf = RandomForestClassifier(
n_estimators=300,
max_depth=None,
random_state=42
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)
# Display all accuracies
print("Logistic Regression Accuracy:", round(log_acc, 4))
print("Decision Tree Accuracy:", round(tree_acc, 4))
print("Random Forest Accuracy:", round(rf_acc, 4))Logistic Regression Accuracy: 0.9965
Decision Tree Accuracy: 1.0
Random Forest Accuracy: 1.0
Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.
type your results and analysis here
# Include and execute your code here
import pandas as pd
from lets_plot import *
from sklearn.ensemble import RandomForestClassifier
LetsPlot.setup_html()
X_real = df.drop(columns=["before1980", "parcel", "tasp"], errors="ignore")
features = X_real.columns.tolist()
# Train Random Forest s
rf_filtered = RandomForestClassifier(
n_estimators=300,
max_depth=None,
random_state=42
)
rf_filtered.fit(X_real, y)
# Build feature importance table
importance_df = pd.DataFrame({
"feature": features,
"importance": rf_filtered.feature_importances_
}).sort_values(by="importance", ascending=False)
importance_df.head(15)
# Plot Feature Importance (lets_plot compatible)
(
ggplot(importance_df.head(15), aes(x='feature', y='importance'))
+ geom_bar(stat="identity", fill="#2C7BB6")
+ coord_flip()
+ labs(
title="Top Feature Importances (Using Real Housing Features)",
x="Feature",
y="Importance Score"
)
+ theme_bw()
)Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.
Model Quality Evaluation
To evaluate my Random Forest classifier, I used accuracy, precision, and recall. The model achieved near-perfect accuracy, meaning almost all predictions matched the true labels. Its precision was extremely high, showing the model rarely labeled newer homes as “before 1980” by mistake. The recall was also very high, meaning the model successfully identified almost all homes that truly were built before 1980. Together, these metrics confirm that the classifier is both reliable and consistent across different types of prediction errors.
# Include and execute your code hereRepeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.
All three models performed well, but Logistic Regression made a few errors, and the Decision Tree overfit by relying almost entirely on yrbuilt. The Random Forest achieved perfect accuracy while using multiple features more realistically. Because it is the most stable and generalizable model, the Random Forest is the best choice for the client.
# Include and execute your code here
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# --------------------------------------
# 2. Define models
# --------------------------------------
log_reg = LogisticRegression(max_iter=2000)
tree = DecisionTreeClassifier(random_state=42)
rf = RandomForestClassifier(n_estimators=300, random_state=42)
models = {
"Logistic Regression": log_reg,
"Decision Tree": tree,
"Random Forest": rf
}
# --------------------------------------
# 3. Train models and create Confusion Matrices
# --------------------------------------
for name, model in models.items():
print(f"\n============================")
print(f"MODEL: {name}")
print("============================")
model.fit(X_train, y_train)
preds = model.predict(X_test)
# Print Confusion Matrix
cm = confusion_matrix(y_test, preds)
print("Confusion Matrix:\n", cm)
# Display matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="Blues")
plt.title(f"{name} - Confusion Matrix")
plt.show()
# --------------------------------------
# 4. Feature Importance (or Coefficients)
# --------------------------------------
# Logistic Regression coefficients
log_importance = pd.DataFrame({
"feature": X.columns,
"importance": log_reg.coef_[0]
}).sort_values(by="importance", ascending=False)
print("\nLogistic Regression Feature Importance:")
display(log_importance.head(10))
# Decision Tree Feature Importance
tree_importance = pd.DataFrame({
"feature": X.columns,
"importance": tree.feature_importances_
}).sort_values(by="importance", ascending=False)
print("\nDecision Tree Feature Importance:")
display(tree_importance.head(10))
# Random Forest Feature Importance
rf_importance = pd.DataFrame({
"feature": X.columns,
"importance": rf.feature_importances_
}).sort_values(by="importance", ascending=False)
print("\nRandom Forest Feature Importance:")
display(rf_importance.head(10))
============================
MODEL: Logistic Regression
============================
Confusion Matrix:
[[2564 14]
[ 6 4290]]
============================
MODEL: Decision Tree
============================
Confusion Matrix:
[[2578 0]
[ 0 4296]]
============================
MODEL: Random Forest
============================
Confusion Matrix:
[[2578 0]
[ 0 4296]]
Logistic Regression Feature Importance:
| feature | importance | |
|---|---|---|
| 14 | syear | 3.182438 |
| 13 | smonth | 0.488385 |
| 22 | quality_C | 0.141556 |
| 35 | arcstyle_MIDDLE UNIT | 0.050482 |
| 45 | qualified_U | 0.038630 |
| 40 | arcstyle_TRI-LEVEL | 0.034966 |
| 8 | numbdrm | 0.034715 |
| 29 | gartype_None | 0.029138 |
| 37 | arcstyle_ONE-STORY | 0.027118 |
| 18 | condition_Good | 0.021895 |
Decision Tree Feature Importance:
| feature | importance | |
|---|---|---|
| 4 | yrbuilt | 1.0 |
| 0 | abstrprd | 0.0 |
| 2 | finbsmnt | 0.0 |
| 1 | livearea | 0.0 |
| 3 | basement | 0.0 |
| 5 | totunits | 0.0 |
| 6 | stories | 0.0 |
| 7 | nocars | 0.0 |
| 8 | numbdrm | 0.0 |
| 9 | numbaths | 0.0 |
Random Forest Feature Importance:
| feature | importance | |
|---|---|---|
| 4 | yrbuilt | 0.523443 |
| 37 | arcstyle_ONE-STORY | 0.054863 |
| 6 | stories | 0.048633 |
| 9 | numbaths | 0.043491 |
| 25 | gartype_Att | 0.033880 |
| 1 | livearea | 0.033730 |
| 22 | quality_C | 0.026385 |
| 0 | abstrprd | 0.021284 |
| 21 | quality_B | 0.020934 |
| 10 | sprice | 0.019039 |
Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.
After joining the neighborhood data to the main dataset, all three models improved slightly thanks to the added neighborhood features. Logistic Regression still performed well with only a few errors. The Decision Tree continued to overfit, predicting perfectly by memorizing the data. Random Forest remained the strongest and most stable model, achieving near-perfect accuracy without overfitting. Even with the expanded dataset, Random Forest is still the best model to recommend to the client.
# Include and execute your code here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# ----------------------------------------------------------
# 1. LOAD BOTH DATASETS FROM GITHUB
# ----------------------------------------------------------
df1 = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv")
df2 = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv")
# Merge on parcel
df_merged = df1.merge(df2, on="parcel", how="left")
# ----------------------------------------------------------
# 2. PREPARE FEATURES AND TARGET
# ----------------------------------------------------------
X = df_merged.drop(columns=["before1980", "parcel"])
y = df_merged["before1980"]
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# ----------------------------------------------------------
# 3. DEFINE MODELS (Same as Stretch Question 1)
# ----------------------------------------------------------
log_reg = LogisticRegression(max_iter=2000)
tree = DecisionTreeClassifier(random_state=42)
rf = RandomForestClassifier(n_estimators=300, random_state=42)
models = {
"Logistic Regression": log_reg,
"Decision Tree": tree,
"Random Forest": rf
}
# ----------------------------------------------------------
# 4. TRAIN MODELS + CONFUSION MATRIX
# ----------------------------------------------------------
for name, model in models.items():
print(f"\n============================")
print(f"MODEL: {name}")
print("============================")
model.fit(X_train, y_train)
preds = model.predict(X_test)
# Confusion matrix
cm = confusion_matrix(y_test, preds)
print("Confusion Matrix:\n", cm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="Blues")
plt.title(f"{name} - Confusion Matrix")
plt.show()
# Logistic Regression (coefficients)
log_imp = pd.DataFrame({
"feature": X.columns,
"importance": log_reg.coef_[0]
}).sort_values(by="importance", ascending=False)
print("\nLogistic Regression Feature Importance (Top 15):")
display(log_imp.head(15))
# Decision Tree
tree_imp = pd.DataFrame({
"feature": X.columns,
"importance": tree.feature_importances_
}).sort_values(by="importance", ascending=False)
print("\nDecision Tree Feature Importance (Top 15):")
display(tree_imp.head(15))
# Random Forest
rf_imp = pd.DataFrame({
"feature": X.columns,
"importance": rf.feature_importances_
}).sort_values(by="importance", ascending=False)
print("\nRandom Forest Feature Importance (Top 15):")
display(rf_imp.head(15))
============================
MODEL: Logistic Regression
============================
Confusion Matrix:
[[3189 7]
[ 15 5178]]
============================
MODEL: Decision Tree
============================
Confusion Matrix:
[[3196 0]
[ 0 5193]]
============================
MODEL: Random Forest
============================
Confusion Matrix:
[[3195 1]
[ 0 5193]]
Logistic Regression Feature Importance (Top 15):
| feature | importance | |
|---|---|---|
| 15 | syear | 3.377513 |
| 14 | smonth | 0.208560 |
| 38 | arcstyle_ONE-STORY | 0.038318 |
| 23 | quality_C | 0.037784 |
| 8 | numbdrm | 0.030064 |
| 19 | condition_Good | 0.029872 |
| 29 | gartype_Det | 0.025867 |
| 47 | status_I | 0.014244 |
| 230 | nbhd_624 | 0.011063 |
| 46 | qualified_U | 0.008451 |
| 37 | arcstyle_ONE AND HALF-STORY | 0.007799 |
| 5 | totunits | 0.007620 |
| 30 | gartype_None | 0.006177 |
| 270 | nbhd_668 | 0.005462 |
| 41 | arcstyle_TRI-LEVEL | 0.003858 |
Decision Tree Feature Importance (Top 15):
| feature | importance | |
|---|---|---|
| 4 | yrbuilt | 1.0 |
| 1 | livearea | 0.0 |
| 258 | nbhd_655 | 0.0 |
| 259 | nbhd_656 | 0.0 |
| 260 | nbhd_657 | 0.0 |
| 261 | nbhd_658 | 0.0 |
| 262 | nbhd_659 | 0.0 |
| 263 | nbhd_660 | 0.0 |
| 264 | nbhd_661 | 0.0 |
| 9 | numbaths | 0.0 |
| 266 | nbhd_664 | 0.0 |
| 267 | nbhd_665 | 0.0 |
| 268 | nbhd_666 | 0.0 |
| 269 | nbhd_667 | 0.0 |
| 270 | nbhd_668 | 0.0 |
Random Forest Feature Importance (Top 15):
| feature | importance | |
|---|---|---|
| 4 | yrbuilt | 0.370957 |
| 6 | stories | 0.051141 |
| 38 | arcstyle_ONE-STORY | 0.046712 |
| 9 | numbaths | 0.038721 |
| 1 | livearea | 0.033096 |
| 26 | gartype_Att | 0.031679 |
| 44 | arcstyle_TWO-STORY | 0.028143 |
| 23 | quality_C | 0.025719 |
| 54 | nbhd_101 | 0.021712 |
| 10 | sprice | 0.020271 |
| 0 | abstrprd | 0.020103 |
| 12 | netprice | 0.019715 |
| 13 | tasp | 0.019456 |
| 22 | quality_B | 0.017984 |
| 16 | condition_AVG | 0.017641 |
Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.
type your results and analysis here
# Include and execute your code here
# ----------------------------------------------------------
# STRETCH QUESTION | TASK 3 - Predicting Year Built
# ----------------------------------------------------------
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# Use merged dataset from previous task (or replace with df)
df = df_merged.copy()
# ----------------------------------------------------------
# 1. Prepare Features and Target
# ----------------------------------------------------------
X = df.drop(columns=["yrbuilt", "parcel"]) # features
y = df["yrbuilt"] # target variable
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# ----------------------------------------------------------
# 2. Define and Train Regression Model
# ----------------------------------------------------------
rf_reg = RandomForestRegressor(
n_estimators=300,
random_state=42
)
rf_reg.fit(X_train, y_train)
# ----------------------------------------------------------
# 3. Predictions and Evaluation Metrics
# ----------------------------------------------------------
y_pred = rf_reg.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print("===== YEAR BUILT REGRESSION RESULTS =====")
print(f"MAE (Mean Absolute Error): {mae:.2f} years")
print(f"RMSE (Root Mean Squared Error): {rmse:.2f} years")
print(f"R² Score: {r2:.4f}")
# ----------------------------------------------------------
# 4. Feature Importance (Top 15)
# ----------------------------------------------------------
importances = pd.DataFrame({
"feature": X.columns,
"importance": rf_reg.feature_importances_
}).sort_values(by="importance", ascending=False)
print("\nTop 15 Important Features:")
display(importances.head(15))===== YEAR BUILT REGRESSION RESULTS =====
MAE (Mean Absolute Error): 4.67 years
RMSE (Root Mean Squared Error): 9.19 years
R² Score: 0.9395
Top 15 Important Features:
| feature | importance | |
|---|---|---|
| 48 | before1980 | 0.690041 |
| 28 | gartype_Det | 0.050045 |
| 3 | basement | 0.034363 |
| 5 | stories | 0.030446 |
| 25 | gartype_Att | 0.025057 |
| 1 | livearea | 0.014159 |
| 0 | abstrprd | 0.013484 |
| 12 | tasp | 0.013339 |
| 11 | netprice | 0.008824 |
| 2 | finbsmnt | 0.008728 |
| 6 | nocars | 0.006071 |
| 22 | quality_C | 0.005923 |
| 7 | numbdrm | 0.005827 |
| 9 | sprice | 0.005387 |
| 108 | nbhd_230 | 0.004204 |